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Abstract: Until recently, understanding the regulatory behavior of cells has been pursued through independent analysis of 
the transcriptome or the proteome. Based on the central dogma, it was generally assumed that there exist a direct corre- 
spondence between mRNA transcripts and generated protein expressions. However, recent studies have shown that the 
correlation between mRNA and Protein expressions can be low due to various factors such as different half lives and post 
transcription machinery. Thus, a joint analysis of the transcriptomic and proteomic data can provide useful insights that 
may not be deciphered from individual analysis of mRNA or protein expressions. This article reviews the existing major 
approaches for joint analysis of transcriptomic and proteomic data. We categorize the different approaches into eight main 
categories based on the initial algorithm and final analysis goal. We further present analogies with other domains and dis- 
cuss the existing research problems in this area. 
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1. INTRODUCTION 

One of the significant objectives of Systems Biology is to 
understand the regulation of cell behavior through interac- 
tions of various components in the regulome (regulation 
components in the cell such as mRNA, proteins, metabolites 
etc.). Two important observational categories involves (a) 
measurement of transcriptomic profiles through techniques 
such as microarray, RNA-seq etc. and (b) measurement of 
proteomic profiles through techniques such as gel electro- 
phoresis and mass spectrometry. Some of the data measure- 
ment techniques may involve destruction of the living cell 
and thus joint measurement of both transcripts and proteins 
in a single cell will not be feasible by such methods. Fur- 
thermore, some approaches may provide expression data on 
the average behavior of a collection of cells and not the ex- 
pression distribution of the cells. Thus, understanding the 
limitations and assumptions in the data measurement tech- 
niques used for measuring the transcriptomic and proteomic 
profiles is essential before conducting a joint analysis of the 
two data sources. Section 2 of this article provides a review 
of the various approaches used for measuring transcriptomic 
and proteomic profiles. 

The next step in developing a joint model of the two do- 
mains involves comprehending the differences in the expres- 
sion of the mRNAs and proteins. Studies [1-5] have shown 
that there can be poor correlation between mRNA and pro- 
tein expression data from same cells under similar condi- 
tions. Section 3 of this article discusses and provides possi- 
ble reasons for the lack of correlation between mRNA and 
Protein expressions. 
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Finally, the type of extracted transcriptomic and proteo- 
mic data and the ultimate goal of analysis will dictate the 
manner of the joint analysis of the two domains. Section 4 
discusses the various approaches for the integrated analysis 
of transcriptomic and proteomic profiles. We divide the var- 
ious available approaches into eight main categories and 
provide an initial overview of the techniques followed by 
specific examples in section 5. 

Section 6 provides analogies between biological scenari- 
os and other physical scenarios so that approaches used for 
the analysis of one can throw insights and be possibly used 
for the analysis of the other. We compare the gene- 
transcriptome-proteome network with an organizational 
command structure and large scale social network. 

Section 7 provides conclusions and future research direc- 
tions. 

The current review focuses on uncovering the primary 
categories of approaches that have been proposed for fusion 
of transcriptomic and proteomic data. In comparison, exist- 
ing reviews on joint transcriptomic and proteomic profiling 
focuses on specific aspects of combined analysis. For in- 
stance, Catherine Hack [6] focuses on different statistical 
methods for correlation between transcriptomic and proteo- 
mic datasets. Cox et al. [7] reviews different methods for 
comparison of microarray and proteomic datasets along with 
clustering and merging options for these datasets. Nie et al. 
[8] focuses on attempts to develop various statistical tools 
for improving the chances of capturing a relationship be- 
tween transcriptomic and proteomic data along with different 
transformation and normalization techniques for data, effects 
on measurement errors and challenges of missing values in 
datasets. A significant part of the paper by Hecker et al. [9] 
reviews approaches to build dynamic models of transcriptomic 
and/or proteomic network. Simon Rogers [10] described the 
available statistical tools for bridging multi-omics data. 
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2. TECHNIQUES FOR TRANSCRIPTOMIC AND 
PROTEOMIC DATASET GENERATION 

2.1. Methods for Transcriptomic Profiling 

Current transcriptomic profiling techniques include DNA 
microarray, cDNA amplified fragment length polymorphism 
(cDNA-AFLP), expressed sequence tag (EST) sequencing, 
serial analysis of gene expression (SAGE), massive parallel 
signature sequencing (MPSS), RNA-seq etc. 

Among the above mentioned technologies, DNA micro- 
array [11] is the most widely used one. But, its application is 
dependent on the availability of complete genome sequence 
or knowledge of significant amount of transcript sequence. 
This technique has evolved from Southern blotting [12] and 
has been widely accepted as an inexpensive analog technique 
for high-throughput transcriptomic profiling. cDNA-AFLP 
[13] is a highly sensitive method which allows the detection 
of low-abundance mRNAs. Recent examples of cDNA- 
AFLP based transcriptomic studies are documented in [14] 
and [15]. EST 1 sequencing is another approach for tran- 
scriptomic profiling which has been used in a large number 
of transcriptomic studies (e.g. [16, 17]). SAGE [18] is a 
RNA-sequencing based transcriptomic profiling method that 
can be used to analyze large number of transcripts quantita- 
tively and simultaneously (e.g. [19] and [3]). MPSS [20] is 
another sequenced based approach for profiling tran- 
scriptomic data which is somewhat similar to SAGE but with 
a substantial difference in sequencing approach and with 
different approach to biochemical manipulation (e.g. [21] 
and [22]). 

The most recent technology for transcriptomic profiling is 
RNA-Seq [23] which is considered as a revolutionary tool for 
this purpose. Eukaryotic transcriptomic profiles are primarily 
analyzed with this technique and it has been already applied 
for transcriptomic analysis of several organisms including 
Saccharomyces cerevisiae, Schizosaccharomyces pombe, Ar- 
abidopsis thaliana, mouse and human cells [24-29]. 

RNA-Seq technology shows clear advantages over exist- 
ing profiling technologies in terms of amount of sequence 
coverage, revealing new transcriptomic insights, accuracy of 
defining transcription level, etc. However, existing microar- 
ray technology still remains reliable to many researchers for 
various reasons (explained in an article by Nathan Blow 
[30]). Overall comparison of existing technologies and most 
recent RNA-Seq technology can be found in recent reviews 
by Nicole Roy et al. [31] and Schirmer et al. [32]. Use of 
different transcriptomic technologies and their success on 
Amyotrophic Lateral Sclerosis study was discussed in recent 
review [33]. Also, the omics-era technologies for systems- 
level understanding of Streptomyceshas have been discussed 
in a recent review [34]. Genome-wide copy number analysis 
[35] is another area where extensive use of different tran- 
scriptomic technologies is exercised. 

2.2. Methods for Proteomic Profiling 

Current state of the art proteomic technologies include: 
2-dimensional difference gel electrophoresis (2D DIGE), 
matrix-assisted laser desorption/ionization (MALDI) imag- 
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ing mass spectrometry, electron transfer dissociation mass 
spectrometry and reverse-phase protein array. 

2D-DIGE is a form of gel-electrophoresis which can la- 
bel 3 different samples of proteins with fluorescent dyes. 
This method overcomes the limitations due to inter-gel varia- 
tion in traditional 2D gel electrophoresis technique (2D-GE) 
[36] of proteomic profiling. Despite the limitation in 2D-GE 
method, it is still a mature proteomic profiling technique 
backed by 3 decades of research. Examples of proteomic 
study using 2D-GE can be found in [37] and [38]; whereas 
[39] and [40] provide examples of using 2D-DIGE technique 
in proteomic study. A detailed comparison between these 2 
techniques can be found in the article by Marouga et al. [41]. 
MALDI imaging mass spectrometry [42] is a unique tech- 
nique for identification of biomarkers in different diseases. 
Studies of proteomics profiling using this technique include 
[43] and [44]. Mass spectrometry based quantitative proteo- 
mic analysis is another form of proteomic profiling which is 
followed by 2D-GE. Here, intensity of protein stain is meas- 
ured to find the existence and amount of protein present in a 
sample. Liquid chromatography mass spectrometry (LC-MS) 
(example studies [45] and [46]), liquid chromatography- 
tandem mass spectrometry (LC-MS/MS) (example studies 
[47] and [48]), in-gel tryptic digestion followed by liquid 
chromatography-tandem mass spectrometry (geLC-MS/MS) 
(example studies [49] and [50]) are different versions of 
mass spectrometry techniques used in proteomic profiling. 
Electron transfer dissociation (ETD) mass spectrometry [51] 
is another form of proteomic study which is a method of 
fragmenting ions in a mass spectrometer. [52] and [53] are 
examples of proteomic studies that use ETD. Reverse-phase 
protein array [54] is a protein microarray technology that has 
use in quantitative analysis of protein expressions in various 
kinds of cells including cancer cells, body fluids and tissues 
(example studies [55] and [56]). Use of several technologies 
stated above on prognosis and outcome of the treatment of 
breast tumor was discussed in a recent review paper [57]. 

3. CORRELATION BETWEEN TRANSCRIPTOMIC 
AND PROTEOMIC DATA 

Until recently, there was an implicit assumption in sys- 
tems biology literature of the existence of proportional rela- 
tionship between mRNA and protein expressions measured 
from a tissue. However, analysis of mRNA and protein ex- 
pression data from same cells under similar conditions have 
failed to show a high correlation between the two domains in 
multiple studies [1-5]. 

To analyze the differences between mRNA and protein 
expressions, we should note that factors having an impact on 
translational efficiency will have an impact on mRNA-protein 
correlation. Physical properties of the transcript have a great 
impact on translational efficiency. One example of such phys- 
ical property can be Shine-Dalgarno (SD) sequence [58, 59] in 
prokaryotic transcripts. Transcripts that have weak SD se- 
quence are translated less efficiently. The SD sequence may 
also be changed by mutation resulting in reduced translational 
efficiency [60]. Reduction in translation due to mutation in 
galE initiation codon has also been reported [60]. 

Another physical property influencing translation is the 
whole structure of the mRNA. Temperature may change the 
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conformation of mRNA and thus influence translation which 
was reported in a study for E. coli [61]. 

In numerous organisms, multiple number of codons can 
be used to translate same amino-acid which is referred to as 
'codon-bias' [62]. Codon adaptation index [63] is the meas- 
ure for codon bias. It is reported that the mRNA-protein cor- 
relation is influenced more by codon bias than by SD se- 
quence [64]. 

Number of ribosome in a transcriptional unit is called 
ribosome-density which has a major influence on efficiency 
of translation. A ribosomal density-mapping procedure to 
explore ribosome positions along translating mRNAs is de- 
scribed by Eldad et al. [65]. mRNAs that entered ribosome 
for translation (ribosome-associated mRNA) shows better 
correlation with proteins than typical mRNA expression 
[66]. Occupancy time of those mRNAs in ribosome also has 
an impact on translational efficiency which was observed in 
case of Yeast [67]. 

Variability (normalized standard deviation) of mRNA 
expression level during the cell cycle can also affect the 
mRNA-protein correlation. This effect is found in cell cycle 
data by Cho et al. [68] which was analyzed by Greenbaum et 
al. [67] and summarized as: "high variability results in high 
correlation with protein expressions". 

The average half-life of eukaryotic mRNA is reported to 
be 10-20h whereas the average half-life is 48-72h in eukary- 
otic proteins in a study in 1989 [69]. From a recent study on 
mammalian cells, it is reported that, mRNAs are 5 times less 
stable and 900 times less abundant than proteins and spanned 
a higher dynamic range [70]. In vivo half-life of a protein 
depends on its amino-terminal residue [71]. Phosphorylation, 
ubiquitination, and localization of proteins are some post- 
transcriptional factors which creates variety of half-lives in 
proteins [72]. Also variation in synthesis and degradation of 
different proteins creates varied half-lives for proteins which 
may affect the correlation of protein expression with mRNA. 
It is also reported that the correlation between half-lives of 
proteins and mRNAs can be low even when the actual ex- 
pression level correlation can be higher and also mRNA and 
protein shares functional properties if they have specific 
combination of stability ('stable mRNA stable protein', 'sta- 
ble mRNA unstable protein', 'unstable mRNA stable pro- 
tein' and 'unstable mRNA stable protein') in mammalian 
cells [70]. The above mentioned stability issues have to be 
analyzed for modeling dynamic mRNA and protein expres- 
sion inter-network. 

Finally, the experimental errors in the type of data extrac- 
tion approach for protein and mRNA expression brings in 
extrinsic noise that significantly influences the correlation 
between mRNA and protein expression. 

4. OVERVIEW OF DIFFERENT APPROACHES 

In this section, we provide a brief overview of the pro- 
posed approaches in literature to jointly analyze transcriptomic 
and proteomic data. The methods for integrating and modeling 
transcriptomic and proteomic networks can be categorized 
primarily into eight different types as discussed next: 

1. Type 1: Union of Transcriptomic and Proteomic Da- 
ta: This can be considered as one of the most obvious 
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integration types. Approaches related to this type 
generally consider a union of two different data sets 
(proteomic data and transcriptomic data; not from the 
same sample) and then create a reference data set. 
The reference data sets have sometimes shown new 
insights and revealed previously undetected phenom- 
enon or supported a new phenomenon as compared to 
the individual data-sets. There are a number of ap- 
proaches related to this [73, 74]. A work on Bradyrhi- 
zobium japonicum bacteroid metabolism in soybean 
root nodules by Nathanael et al. [49] can be an exam- 
ple of this method. In this study, authors have com- 
piled a reference dataset by combining (union) tran- 
scriptomic and proteomic data. Based on the refer- 
ence dataset, they have discovered significant number 
of enzymes related to several types of bacterial me- 
tabolism that were not present in the dataset from pro- 
teomic study alone. Section 5.1 briefly reviews the 
approach considered by Nathanael et al. 

2. Type 2: Extraction of Common Functional Context of 
Transcriptomic and Proteomic Features: For various 
reasons [72], transcriptomic and proteomic data may 
not have direct overlap in features (here feature refers 
to different genes for transcripts and proteins). But 
features on transcriptomic and proteomic level might 
share the same functional context. These functional 
contexts may refer to different biological processes or 
pathways in which features from both transcripts and 
proteins are enriched. In this approach, the common 
functional contexts are extracted through the analysis 
of both transcriptomic and proteomic datasets on the 
level of protein interaction networks. This approach 
was published in 2010 by Paul et al. [75] which is 
discussed in section 5.2. Authors of this publication 
also generated omicsNET for finding dependency be- 
tween features of proteomics and transcriptomics. 

A similar kind of approach (functional analysis) was ap- 
plied for integrating transcriptomic and proteomic evaluation 
of gentamicin nephrotoxicity in rats by Com et al. in 2011 
[39]. But the functional analysis was done by GO-Browser 2 
(an in-house Gene Ontology based annotation tool) with help 
of Ingenuity Pathway Analysis software 3 . Based on the func- 
tional analysis, some gene ontology biological processes 
were selected which were enriched by the features of the 
transcriptomic and proteomic dataset with Fisher 
P va / Me -0-05. This integration by functional analysis reveals 

a putative model of toxicity [39] in the kidney of rats. 

3. Type 3: Topological Networks Approach: Topological 
network methods (over-connection analysis, hidden 
node analysis, rank aggregation and network analysis) 
have been used to elucidate the common regulators 
(transcriptional factors and receptors) from two dif- 
ferent types of data sets (transcriptomic and proteo- 
mic) by Eleonora Piruzian et al. [37]. This category 
of approach refers to locating upstream regulators of 
mRNA and proteins individually and collecting the 



2 Gene Ontology annotation tool, 

http://www.geneontology.org/GO.tools_by_type.browser.shtml 

3 Ingenuity systems, USA, http://www.ingenuity.com/ 
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common regulators in both the networks for a com- 
bined signaling pathway. Topological and network 
analysis was used in finding individual transcription 
factors (TF) of mRNAs and Proteins. The TFs that 
were not common in transcriptomic and proteomic 
profiles were ignored and the common TFs were used 
to find the most influential receptors that could trigger 
maximal possible transcriptional response. Among 
the receptors discovered from joint analysis, some of 
them were never reported as psoriasis markers in ear- 
lier studies while some of them have been reported 
before. In another recently published study [76], an 
integrated quantitative proteomic, transcriptomic, and 
network analysis approach was discussed which also 
reveals molecular features of tumorigenesis and clini- 
cal relapse. Section 5.3 discusses the approach of 
Eleonora et al. . 

4. Type 4: Merging datasets in individual domains: 
Type 4 integration merges multiple proteomic data 
sets into a merged-proteomic data set along with join- 
ing multiple transcriptomic data sets into a reference 
transcriptomic data set. The transcriptomic and prote- 
omic datasets that are merged can be created by dif- 
ferent transcriptomic and proteomic profiling respec- 
tively. After merging the datasets, correlation analysis 
is conducted between these 2 merged data-sets and it 
is shown that the coefficient of correlation is better 
than the one without merging. Furthermore, specific 
subsets of the merged data sets can have higher coef- 
ficient of correlation. Dov Greenbaum et al. [67] used 
such an approach in their publication in 2003 which is 
discussed in section 5.4. 

5. Type 5: Missing Value Estimation by non-linear op- 
timization: This category of integration uses non- 
linear or linear optimization to predict missing values 
of proteomic data. It maximizes an objective function 
to find out the connections between transcriptomic 
and proteomic networks. However, they do not result 
in a dynamic model able to predict the abundance of 
next time point but rather, they are able to predict the 
protein expression at the same time point. A good ex- 
ample of non-linear optimization is a method de- 
scribed in Wandaliz Torres-Garcia et al. [77] for a 
study of Desulfovibrio vulgaris published in 2009. 
The method is based on stochastic gradient boosting 
tree (GBT) proposed by Friedman et al. [78]. Sto- 
chastic GBT optimization technique was also used in 
a study of Shewanella oneidensis in 2011 [79]. Artifi- 
cial neural network approach was applied to find the 
missing values of the proteins using the relations be- 
tween transcriptomic and proteomic data in a separate 
study published in 201 1 [80]. In section 5.5, we brief- 
ly review the approach made by Garcia et al. [77] in 
their Desulfovibrio vulgaris study. 

6. Type 6: Multiple regression analysis to predict con- 
tribution of sequence features in mRNA-protein cor- 
relation: Protein abundance is not only related to cor- 
responding mPvNA abundance but also depends on 
other biological and chemical factors (termed as co- 
variates). For this reason, the idea of multiple regres- 
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sion analysis is used to relate characteristics of differ- 
ent covariates of each individual gene with the 
mRNA-protein correlation. The multiple regression 
approach can possibly provide a better explanation of 
protein variability than traditional single regression 
technique. Effect of multiple sequence feature (one 
kind of covariate) on mRNA-protein correlation was 
discussed by Nie et al. in 2006 [81] where they have 
used multiple regression analysis. Example of another 
linear regression model can be Poisson's linear re- 
gression model which has been used by Lie Nie et al. 
[47] to elucidate the relationship model of tran- 
scriptomic and proteomic networks. In section 5.6, we 
briefly explain the multiple regression analysis used 
in [81]. 

7. Type 7: Clustering Approaches: Clustering mRNA 
and protein abundance datasets individually and lo- 
cating similarities (and hence correlation) between the 
individual clusters does not produce promising results 
(as explained in section 5.7). This failure leads to the 
assumption that concatenating the proteomic and 
transcriptomic datasets and then clustering the con- 
catenated dataset may not be a good idea either (de- 
tails in section 5.7). Based on these observations, a 
new clustering method called coupled clustering was 
implemented by Rogers et al. [82]. Couple clustering 
creates certain number of proteomic and tran- 
scriptomic clusters and provides the conditional prob- 
ability of a gene to be in a protein cluster given that it 
is in an mRNA cluster. These conditional probabili- 
ties can reveal the relational complexity of mRNA 
and protein data. Rogers et al. used time series tran- 
scriptomic and proteomic data extracted under same 
experimental conditions. Section 5.7 discusses cou- 
pled clustering approach. We would want to empha- 
size that this type of approach is also not a dynamic 
modeling approach that can provide temporal predic- 
tions. 

8. Type 8: Dynamic Modeling: A number of studies re- 
ported in the literature have inferred dynamic models 
( such as Boolean network, linear models, differential 
equation models, Bayesian networks etc.) of GRNs 
from time series transcriptomic data alone. For exam- 
ple, Liang et al. [83] used REVEAL algorithm for in- 
ference of Boolean network model from time series 
mRNA expression data. A basic linear modeling has 
been proposed by D'haeseleer [84]. GRN Models 
consisting of differential equations was employed by 
Guthke et al. [85]. Validation of inference procedures 
of GRN was discussed by Edward R Dougherty [86]. 
Friedman used Bayesian networks to analyze and 
model gene expression data [87]. Among the existing 
network models, Bayesian networks can be applied to 
combine heterogeneous data and prior biological 
knowledge. For example, Nairai et al. [88] used pro- 
tein-protein interaction network data for refining the 
Bayesian Network model of the GRN produced by 
mRNA data alone. Yu Zhang et al. [89] used tran- 
scriptional factor binding site data and gene expres- 
sion data (transcriptomic) to model GRN using 
Bayesian network approach. Werhli et al. [90] inte- 
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grated multiple sources of prior biological knowledge 
(TF binding location) with microarray expression data 
to generate a Bayesian network model. 

Section 5.8 discusses approach used by Nariai et al. [88]. 
Similar kind of approach was also used by Segal et al. [91] 
where they identified pathways from microarray data and p-p 
interaction data. 

5. SPECIFIC EXAMPLES FOR INTEGRATED 
ANALYSIS CATEGORIES 

In this section, we provide details and specific examples 
for the eight categories of approaches for joint analysis of 
transcriptomic and proteomic data discussed in the previous 
section. 

5.1. Type 1 Example: Integration of Omics Data Gener- 
ated Under Symbiotic Conditions 

Brady rhizobium japonicum is a gram negative, rod- 
shaped, nitrogen-fixing bacterium that communicates with 
its host plant and develops a symbiotic partnership with its 
host. The host considered in [49] is the soybean plant Gly- 
cine max 4 . The complete genome sequence of Bradyrhizobi- 
um japonicum was identified by Kaneko et al. [92] where 
8317 potential protein-coding genes were found. 66153 pro- 
tein-coding loci have been identified in the genome sequence 
of Glycine max 5 . 

Nathanael et al. [49] built a database by combining the 
above mentioned 8317 proteins of Brady rhizobium japoni- 
cum, 62199 of the above mentioned 66153 proteins of Gly- 
cine max and 258 contaminating proteins. They searched the 
combined database to locate the experimental protein ex- 
tracts of Brady rhizobium japonicum soybean bacteroids. 
GeLC-MS/MS experimental data was used for this study. A 
probability-based protein identification algorithm [93] was 
employed to identify the proteins from mass spectrometry 
data by searching the sequence database. The use of com- 
bined database was beneficial because of the fact that soy- 
bean proteins present in the nodule extracts of Bradyrhizobi- 
um japonicum and Glycine max might have symbiotic rela- 
tions. 2315 proteins in the experimental dataset were also 
reported to be present in the combined database. 

The expression of 2780 B. japonicum genes in soybean 
nodules was reported by Pessi et al. [94] in a transcriptomic 
study in 2007. The experimental condition of this genomic 
study was same as the proteomic study made by Nathanael et 
al. [49]. In both the transcriptomic and proteomic analysis, 
stringent filtering criteria for normalization procedure were 
applied. Several statistical analysis tests (Wilcoxon rank-sum 
and the student t-test with a P-value threshold of 0.01) based 
on numerous biological replicates were applied in the micro- 
array based transcriptomic study to prevent erroneous con- 
clusions. The use of transcriptomic expression profiling 
alone has some limitations (such as limitations of the array 
i.e. not all the genes are present in the array, concealment of 
true expression levels due to bias of the probe set, etc.) 



which are also somehow true for using only proteomic ex- 
pression profiling. Thus, the authors chose to integrate 
(through a simple union method) both the data sets and use 
this list of genes as reference data set for bacteroid expres- 
sion. In total, 3587 transcriptomic (A) and protein (B) ex- 
pressions in soybean bacteroids were recorded in the union 
set (AUB). The number of elements in the set Af)B was 
1508. 807 proteins were identified to be expressed by only 
the proteomics approach (B-A) and 1272 genes that have 
been identified as expressed only in the transcriptomic study 
(A-B). Among the set A-B, 47 were RNAs (45 tRNAs, 
rnpB, ssrA2) and the remaining 1225 were protein encoding 
genes Fig. ((1). Summarizes the method of integration. 



Transcriptomic 

data 
2780 genes (A) 



Proteomic 
data 
2315 genes (B) 




Reference 
dataset 



Fig. (1). Integration of transcriptomic and proteomic dataset by 
simple union method. 

A list of 15 gene function categories 6 were observed for 
the 3 different datasets: (i) the dataset X by Kaneko et al. 
[92] consisting of 8317 protein encoding genes, (ii) the ref- 
erence datasets AUB consisting of 3540 (3587-47) protein 
encoding genes and (Hi) the dataset X-(AUB) consisting of 
4777 protein encoding genes i.e. the protein encoding genes 
that are not detected by Nathanael et al. [49]. The number of 
genes present in the 3 datasets for each category was detect- 
ed. The number of genes/proteins present in each category 
was divided by the total number of genes/proteins in that 
data set and thus relative frequency of each category for the 
3 datasets was established. In 4 among the 15 categories, it 
has been observed that the relative frequency in the reference 
dataset (AUB) is less than the relative frequency in the 
Kaneko dataset (X). This means that in only 4 categories, the 
reference dataset A UB was less capable to represent that cat- 
egory than the dataset X. 



4 details of this symbiotic relation can be found at: 

http://web.mst.edu/djwesten/Bj.html 

5 http://genome.jgi.doe.gov/soybean/soybean.info.html 



6 The 15 categories are (i) Amino acid biosynthesis (ii) Biosynthesis of 
cofactors, prosthetic groups and carriers, (Hi) Cell envelope (iv) Cellular 
processes (v) Central intermediary metabolism (vi) Energy metabolism 
( vii) Fatty acid, phospholipid and sterol metabolism ( viii) Purines, 
pyrimidines, nucleosides, and nucleotides (ix) Regulatory functions (x) 
DNA replication, recombination, and repair (xi) Transcription (xii) 
Translation (xiii) Transport and binding proteins (xiv) Hypothetical and 
(xv) other categories. These were found from 

http://genome.kazusa.or.jp/rhizobase/Bradyrhizobium/genes/category. 
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The reference dataset revealed novel insights regarding 
some aspects of bacterial metabolism (e.g. Nitrogen metabo- 
lism, Carbon metabolism, Nucleic Acid metabolism) and 
also regarding translation and post-transcriptional regulation. 
For example (i) some key regulator proteins (e.g. GlnA, 
GlnB, GlnK and Glnll) of N metabolism was identified in 
the reference dataset. (ii) The authors reported that all the 
enzymes related to C4 metabolism were detected for the 1st 
time as well as almost the entire set of gluconeogenesis re- 
lated enzymes was identified in the combined reference data 
set. (Hi) In a study of global protein expression pattern of 
Brady rhizobium japonicum bacteroids by Sarma et al. [95], 
nucleic acid metabolism related proteins were reported to be 
lacking in the total protein expression pattern of the nodule 
bacteria. But, in this study, authors have found almost all 
enzymes related to de novo nucleoside and nucleotide bio- 
synthesis either in the gene or in the protein level or in both. 
(iv) The reference dataset also comprises a large number of 
proteins related to transcriptional and post-transcriptional 
regulation. Also, the enzymes related to protective response 
(to reactive oxygen species) under stress were also discov- 
ered in the reference dataset. In this context, it can be men- 
tioned that, glutathione is crucial for biotic and abiotic stress 
management for plants, thus, detecting all the enzymes re- 
quired for glutathione synthesis and reduction strengthens 
the richness of the reference dataset. Very few enzymes in 
metabolic pathways were discovered in the dataset derived 
from the proteomic study alone (dataset B). Thus the combi- 
nation of data can provide novel insights for carbon and ni- 
trogen metabolism. 

5.2. Type 2 Example: Functional Analysis of Tran- 
scriptomic and Proteomic Data 

An approach for linking transcriptomic and proteomic 
data on the level of protein interaction network has been dis- 
cussed in a recently published paper [75]. Transcriptomic 
and proteomic datasets characterizing the chronic kidney 
disease (CKD) have been used to illustrate the procedures 
for integrating omics profile at the level of protein interac- 
tion networks. 

Three publicly available studies on CKD by Schmid et 
al. [96], Baelde et al. [97] and Rudnicki et al. [98] are used 
for identifying deregulated features on the mRNA level. 697 
differentially regulated genes were selected from the 3 stud- 
ies creating the transcriptomic dataset. The proteomic dataset 
was extracted from the online database HUPDB v2.0 1 . A 
total of 192 samples were used and after comparing with the 
CKD and healthy references, 37 proteins were identified as 
differentially abundant. HUPDB was selected as the only 
source to avoid heterogeneity of datasets. Swiss-prot annota- 
tion tool was used in this study (as HUPDB uses Swiss-prot 
names as identifiers) to map the proteins to the gene sym- 
bols. Swiss-Prot [99] or UniprotKB is a protein sequence 
database which provides all known relevant information 
about a particular protein. The details about Swiss-prot entry 
annotation can be found in http://www.uniprot.org/faq/45 . 



7 HUPDB v2.() Human Urinary Proteom database 
http://mosaiques- 

diagnostics.de/diapatpcms/mosaiquescms/front_content.php?idcat=257 
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The following five different analysis procedures have 
been discussed and compared extensively in this study to 
elucidate the correspondence between transcriptomic and 
proteomic data: (i) Direct feature overlap, (ii) Functional 
overlap; here features (genes) related to different biological 
processes that are found in transcriptomic or proteomic da- 
tasets or both has been analyzed, ( Hi) Joint pathway analysis, 
(iv) Protein dependency graph analysis and (v) Direct edges 
between transcripts and proteins. 

(i) Direct feature overlap: Here 'feature' refers to different 
genes for transcripts and proteins. Features present in both 
the transcriptomic and proteomic lists were identified. Genes 
of 4 proteins 8 (out of 37) were also reported to be differen- 
tially expressed in the transcriptomic dataset. This overlap 
was confusing as only 1 of them was upregulated in both the 
datasets but the other 3 shows upregulation in one dataset 
and downregulation in another. 

(ii) Functional Overlap: PANTHER (Protein ANalysis 
THrough Evolutionary Relationships) classification system 
[100, 101] classifies proteins and their genes to facilitate 
high-throughput analysis. The classification of proteins were 
done according to 'family and subfamily', 'molecular func- 
tion', 'biological process' and 'pathway'. Multiple gene lists 
can be uploaded in the PANTHER system and jointly com- 
pared against a reference dataset to look for under and over 
represented functional categories based on either 'chi-square 
test' or 'binomial statistics' tool. Here, the authors used 
PANTHER to identify enriched biological process. They 
have used fully annotated set of human genes as a reference 
dataset and used chi-square test with a p-value less than 0.05 
to identify significantly enriched or depleted biological pro- 
cesses. They had identified 27 biological processes 9 that are 
relevant to the transcriptomic and proteomic datasets. Four 
of the processes were found to be enriched by both the tran- 
scriptomic and proteomic feature set. Other processes were 
enriched by either transcriptomic or proteomic features. 

(Hi) Joint Pathway Analysis: The laboratory of Immuno- 
pathogenesis and Bioinformatics (LIB) developed the DA- 
VID (the Database for Annotation, Visualization and Inte- 
grated Discovery) tool [102] to provide functional interpreta- 
tion of large lists of genes derived from different genomic 
studies. KEGG pathway database is used as a repository for 
applying DAVID tool in this study. Seven pathways 10 are 
found to be significantly enriched in deregulated transcripts 
and/or proteins using Fisher exact test with p-value less than 
0.05. Among these 7, 3 pathways are enriched with both the 



8 These 4 genes are COL15A1, UMOD, PTGDS and APOA1 

9 The 27 biological processes are: Protein metabolism and modification, 
Blood circulation and gas exchange, Cell structure and motility, 
Development processes, Immunity and defense, Protein modification, 
Signal truansduction, Cell structure, Cell motility, Intracellular protein 
traffic, Cell cycle, Cell adhession, Cell communication, Intracellular 
signalling cascade, Mesoderm development, Mitosis, Ectoderm 
development, Protein phosphorylation, Blood clotting, Cell proliferation 
and diferentiation, Cell cycle control, Neurogenesis, Homeostasis, 
Interferon-mediated immunity, Angiogenesis, Chromosome segregation, 
Apoptosis 

10 The 7 pathways are: Cell communication, ECM-receptor interaction, p53 
signaling pathway, Complement and coagulation cascades, Tight 
junctions, Regulation of active cytoskeleton, Focal adhesion 
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transcriptomic and proteomic features and 4 pathways are 
enriched with either transcriptomic or proteomic features. 

(iv) Protein Dependency Graph Analysis: PANTHER and 
KEGG do not cover all the features found in transcriptomic 
and proteomic dataset used in this study. So, the authors had 
developed undirected protein interaction network omicsNET 
[103] which includes all protein encoding genes as nodes in 
the network. It has edges between the nodes with edge 
weights referring to dependency measures between the pair 
of nodes. The dependency measures were determined using 
Gene Expression Omnibus Human Body Map, the Micro- 
Cosm database, Gene Ontology data on molecular processes 
and functions, PANTHER, KEGG and IntAct databases. The 
research team found 65 strong dependencies in omicsNET 
between the features of transcriptomic and proteomic dataset 
of this study. The features that are involved in the dependen- 
cy graph include 21 features from transcriptomic dataset, 21 
features from proteomic dataset and 2 features from both. 
Also, dependencies between the features related to blood 
coagulation cascade was analyzed using omicsNET on dif- 
ferent edge weight values. 

(v) Direct edges between transcripts and proteins: MAP- 
PER 11 (Multi- genome Analysis of Positions and Patterns of 
Elements of Regulation) is a platform for identifying tran- 
scription factors binding sites (TFBSs) in multiple genomes 
[104]. Binding sites of 4 transcription factors were identified 
in ORF (open reading frame) regions of the 37 proteins using 
MAPPER. 2 of these 4 transcription factors shows upregula- 
tion and 2 shows downregulation in mRNA level. At least 1 
of these 4 transcription factor binding sites are present in 13 
proteins of the protein dataset. This fact reveals some direct 
edge between transcripts and proteins. 

Use of direct feature overlap gave only limited and am- 
bivalent results. But this limited overlap increases when en- 
riched biological processes were identified based on tran- 
scriptomic and proteomic datasets. Several biological pro- 
cesses were identified significantly enriched with both tran- 
scriptomic and proteomic features. Mapping transcriptomic 
and proteomic features on different KEGG pathways also 
reveals significant involvement of both transcriptomic and 
proteomic features; three KEGG pathways are found to be 
enriched with them. And finally using omicsNET for over- 
coming the shortcomings of PANTHER and KEGG gives 
significant outcome for understanding the interactions of the 
transcriptomic and proteomic network. 

The main advantage of functional analysis of transcriptom- 
ic and proteomic data is that different pathways and processes 
for the genes under analysis become evident. Although de- 
pendency measures can be found between transcripts and pro- 
teins from omicsNET, the main shortcoming is that it cannot 
create a dynamic model involving transcripts and proteins. 

5.3. Type 3 Example: Comprehensive Meta-Analysis 

Topological network analysis was used to reveal the sim- 
ilarities and differences between transcriptomic and proteo- 
mic-level perturbations in psoriatic lesions in a recent study 
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[37]. The transcriptomic data related to psoriatic lesions con- 
tains 462 over-expressed transcripts and the proteomic data 
contains 10 abundantly expressed proteins. Unlike most of 
the studies, this study shows good consistency between the 
proteomic and transcriptomic dataset as 7 out of the 10 pro- 
tein encoding genes were also over-expressed in the tran- 
scriptomic dataset. But the significant difference in the mag- 
nitudes of the 2 dataset hinders direct correlation analysis. 
Rather than analyzing correlation between them, the authors 
used topological network approach to discover regulatory 
transcription factors, receptors and their ligands to recon- 
struct the network between them. Their approach produced 
biologically meaningful results and revealed unknown regu- 
latory receptors that may be related to psoriatic lesions. 

The methods used in the study include (i) Over- 
connection analysis (ii) Topological analysis: hidden node 
analysis (Hi) Rank aggregation and (iv) Network analysis. 

Interactome Overconnectivity Analysis: It is assumed in 
this analysis method that the expression values of tran- 
scripts and proteins follow hyper geometric distribution. 
The method for finding overconnected regulators (tran- 
scription factors) of a target dataset is described in the sup- 
plementary material of publication by Nikolsky et al. [105]. 
This overconnection analysis mainly ranks transcription 
factors (assign a score to it). The score or significance of a 
transcription factor (taken from global gene database or 
manually curated gene database like MetaCore 12 ) is a func- 
tion of 'hypergeometeric distribution probability mass 
function' [37, 105]. 

Hidden Node Analysis: The complete algorithm for hidden 
node analysis 13 has been discussed by Dezso et al. [106]. 
Here we'll try to demonstrate what it actually does by a very 
simple example. Fig. ((2) demonstrates 4 genes (nodes) x^, Xj, 

and x^ which are over-expressed or abundant in a tran- 
scriptomic or proteomic dataset. Hidden node analysis reveals 
node which was not present in the experimental data but is 
the key to regulate downstream effects of targets x^, x^ and x. 
. The members of the hidden nodes may come from a global 
database or manually curated database like MetaCore. 

Rank Aggregation: When we have multiple ordered lists, 
rank aggregation approaches can be used to combine them 
into a single list. Rank aggregation can be formulated as a 
optimization problem [107] with the objective function being 
the weighted sum of distances of the original list from the 
m 

combined list 0(L)= ^ wd(L,L) where L is the corn- 
el 

bined list and L. denote the individual lists and d is a dis- 
i 

tance function. An example of the distance function is the 
Spearman footrule distance which is the absolute sum of the 
differences in the ranks of the unique elements of the indi- 
vidual list and the combined list. The optimization can be 
carried out using approaches such as Cross-Entropy Monte 
Carlo stochastic search [108] or genetic algorithms. 



11 http://mapper.chip.org/ 



http://www.genego.com/metacore.php 

Hidden node analysis: http://www.genego.com/hidden_nodes.php 
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Network Analysis: A typical network analysis helps to 
select biologically connected meaningful sub-networks with 
relevant objects. For example, 10 over-expressed proteins 
were taken as relevant objects in this study and published 
literature was used to come up with regulatory transcription 
factors, receptors and kinases. 

For the integrated study, 10 common transcription factors 
(top ranked) of transcriptomic and proteomic data types were 
identified using overconnection analysis and hidden node 
analysis. But before this step, 20 TFs were identified for 
each data type using topology analysis and 5 TFs were iden- 
tified using network analysis approach. These 10 common 
nodes in each data type shows resemblance with the TFs 
found from literature search. Next, hidden node algorithm 
was used to find the most influential 44 membrane receptors 
that are present in the same signaling pathway with one of 
the 10 common TFs and whose genes or corresponding lig- 
ands were 2.5 fold (or greater) over expressed in the experi- 
mental data. For the hidden node analysis to find the 44 re- 
ceptors, the target set consisted of 462 differentially ex- 
pressed genes. Among the 44 receptors, 22 were previously 
reported to be related to psoriatic lesions. 14 of them shows 
possible but not confirmed relation to psoriatic lesions and 
rest of the 8 receptors were not reported before. 

Thus, topological analysis can be applied to identify 
common regulators from two different datasets. The com- 
mon regulatory machinery can be applied to arrive at a bio- 
logically meaningful signaling pathway that can be verified 
through new experiments or existing reported results in liter- 
ature. 




Fig. (2). Hidden node analysis reveals new node X5. The dotted line 
represents the connectivity before hidden node analysis and the 
solid line reresents the connectivity after hidden node analysis. 

5.4. Type 4 Example: Merging Datasets into Meta- 
Datasets 

In a review paper published in 2003, Dov Greenbaum et 
al. [67] discussed the results of the comparisons of different 
approaches on correlating mRNA expression with protein 
abundance. The authors focused on yeast. 

At first, they created a 'reference mRNA dataset' using 
iterative combination of different datasets (as shown in Al- 



gorithm 1) that had been earlier discussed by Greenbaum et 
al. [38] in 2002. They used 4 different datasets from 4 dif- 
ferent studies: 3 of the datasets [109-111] used Affymetrix 
chips and the remaining dataset [112] used the SAGE meth- 
od. The 4 datasets were merged using algorithm 1 to create 
the 'mRNA reference dataset'. 

As a following step, four proteomic datasets were from 
Gygi et al. (2DE-1 dataset [3]), Futcher et al. (2DE-2 dataset 
[113]), Washburn etal. (MudPit-1 dataset [1 14]) and Peng et 
al. (MudPit-2 dataset [115]) were merged to create a refer- 
ence protein dataset. The merging technique is illustrated in 
algorithm 2. 

The correlation coefficient (r) of the reference mRNA 
dataset and the reference protein dataset was derived to be 
r=0.66. Correlation for some smaller functional categories of 
proteins was also calculated and a mix of higher correlation 
values 14 as well as lower correlation values 15 was reported. 

The ideas of merging mRNA datasets (Algorithm 1) and 
merging proteins (Algorithm 2) have some sort of similarity. 
The mRNA merging technique uses one of the mRNA da- 
tasets it is merging as a reference that is used to find the re- 
gression parameters while the protein merging technique 
uses the reference mRNA dataset in this regard. The values 
of a and (3 is an important factor in mRNA merging tech- 
nique while the quality ranking of the protein datasets have 
important influence on merging the protein datasets. The 
order 2DE-\>2DE-2>MudPit-2>MudPit-\ is used as qual- 
ity ranking that depends on the confidence level of the accu- 
racy of the datasets. 

Algorithm 1: Algorithm for generating the mRNA reference 
dataset 

a) Let X^-.X define different mRNA expression da- 
tasets from n different experiments using Gene Chips. 
Let the last dataset X^ denote the dataset with highest ac- 
curacy. Let X n be denoted by X^. The dimension of the- 
se datasets may not be equal i.e. each dataset does not 
contain mRNA expression of all N genes that are present 
jointly in the n datasets. Here XjJ) will denote the mRNA 

expression of the ith gene in the jth database /6l,2,..JV 
and j'£l,2,...«. 

b) Initialize MergedData=[] 

c) Set a=0. 15. This can be changed. 

d) forj=l:n-l 

( i) . Find the common genes present in X. and X^ 

(ii) . Find the parameter p^ and p^ while minimizing the 



r , =0.8, r „ . , =0.74, r =0.71 

cucieolus cetlperipnery cellcycle 



r . , J . =0.42, r „ =0.45 

mitochondria cellrescue 
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expression ^ (p^^(k)-X^(k))^ where k is the 

member of common gene set in X. and 

(in). fori=l:N 

if gene i is present in both X. and X^ and 

Pl /. 2 (i)-X H (i) 



■<a thenMergedData(i) 



p^hyx^i) 

els eif gene i is present only in Xj then 

MergedData(i)=Xj(i) 
els eif gene i is present only in X^then 

MergedData(i)=Xj_^i) 

elseif gene ;' is present neither in X. nor in X^then 

Do nothing. mRNA expression for ith gene will not 
be present in the MergedData 

endif 

endfor 

( iv). Make the MergedData as the X^ for next iteration 
endfor 

e) Let S denote the SAGE data. 

f) Set (3=16. 

g) fori=l:N 

if gene i is present in both MergedData and 5 and 
MergedData(i)>fi and MergedData(i)<S(i) then 
MergedData(i)=S(i) 

endif 

endfor 

Algorithm 2: Algorithm for creating Protein reference dataset 

(a) Let Pp P2-.-.P n Denote protein expression datasets from 

n different experiments. Let the last dataset P n denote the 

dataset with the highest confidence level in its accuracy. 
The dimension of these datasets may not be equal i.e. 
each dataset does not contain protein expression of all N 
genes that are present jointly in the n datasets. Here P.{i) 

will denote the protein expression of the ith gene in the 
jth database. z'El,2,..JV and j£l,2,...n 

(b) Let M denote the mRNA reference dataset created using 

algorithm 1 

(c) Find the parameters a^ and bj while minimizing the ex- 



pression ^ {aM j(k)-Pj(k)) where k is the member of 
k 

the common gene set in P. and M and j£l,2,....n. 

(d) Find Y-a.M* 3 ) for all j£l,2,...«. This is the transfor- 

mation of the protein databases into mRNA reference 
dataset. 

Y b f 

(e) Find Pj = a (— ) 1 for all jEl, 2,...n. This is the in- 

" a J 

verse transform into the protein space using parame- 
ters of the most accurate set P . 

n 



(f) Combine the set P 1 ,P 2 ,...P into MergedData. To do 

this, order of confidence for the original datasets will be 
used i.e. if expression for gene ;' is present in multiple da- 
tasets, then the value of that gene in MergedData will 
come from the dataset with the highest confidence. 



The authors' argument in support to their proposed pro- 
tein merging technique is that the resulting protein reference 
dataset is a better quantitative and a representative one that is 
easier to compare with the mRNA expression dataset and is 
supported by the increased correlation coefficient in some 
functional categories. Thoughtful scaling techniques were 
applied to avoid biases in the datasets and in case of multiple 
possibilities of entering values into the reference dataset 
from individual datasets, a plausible method of quality rank- 
ing was used. So, the ultimate success can be viewed as find- 
ing higher correlations among different functional catego- 
ries. The lower correlation in a functional category reflects 
heterogeneity in that category. 

We should note that often due to different half-lives of 
mRNAs and Proteins, we will observe lack of correlation 
between transcriptomic and proteomic datasets measured at 
the same time under symbiotic conditions. Merging of mul- 
tiple datasets normalizes and integrates the expression values 
which are more likely to have a higher correlation. But the 
emphasis on correlation can be sometimes misleading as 
correlation measures the linear dependence and a perfect 
non-linear dependence between two variables can be ignored 
by correlation analysis. 

5.5. Type 5 Example: Non-Linear Optimization Model to 
Integrate Transcriptomic and Proteomic Data 

Garica et al. [77] implemented stochastic Gradient 
Boosting Tree (GBT) approach to infer non-linear relation- 
ships between mRNA and protein expression data and esti- 
mate the missing protein expressions using the generated 
relationship. They had mRNA expression data of around 
3500 genes and protein expression data of around 800 genes 
of Desulfovibrio vulgaris. After locating the non-linear rela- 
tionship and the missing protein expression values, they val- 
idated the result using knowledge from literature. The total 
procedure is shown in Fig. (3). 
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Fig. (3). Flowchart of the method used by Garcia et al. [77]. 

Gradient Boosting Tree Method: Gradient boosting ap- 
proach was proposed by Jerome H. Friedman in 2001 [116]. 
It is a nonlinear regression technique that produces a predic- 
tion model in the form of decision trees. Some modification 
of this method has been proposed by Friedman in 2002 [78]. 
The method uses a training set (xj,yj), (x >y^) that tries 

to find an approximation G(x) to a function G(x) that mini- 
mizes the expected value of some specified loss function 
<p(y,G(x)). 

G M = argmin E y jp( y ,G(x)) 

C(-r) 

Several loss criteria can be used including least squares: 
2 

q>(y,G)=(y-G) , least absolute deviation: 
cp(y,G)= ly-Gl etc. Garcia et al. used least squares criteria. 

The total procedure was divided into a iterations. Each 
iteration is called a regression tree. In each tree, pseudo re- 
siduals of the dependent variable (here, proteomic expres- 
sion) of the training dataset were located. Then the total in- 



put space has been divided into (3 disjoint regions using least 
square splitting criterion [87]. These regions are called the 
leaves of the tree. In each region, a multiplier value is 

calculated; here a denotes the a^ 1 tree and (3 denotes the 
leaf of that tree. 

Thus, the approximation function G(x) mainly depends on 
the splitting variables and splitting points for each tree and 
also on the value of n in each leaf. Using the training dataset, 
these variables are estimated and used in future prediction. 

Algorithm 3 demonstrates how Gradient Boosting Tree 
method works. Here the modified version by Friedman 
which he called TreeBoost is shown. Friedman introduced 
another parameter £ (0<i;<l) which controls the learning rate 
of the algorithm. This is called 'shrinkage parameter'. In 
each tree of each iteration, the value of u is multiplied by the 
value of According to Friedman [116], choice of the value 
of ^ is important for the performance of the algorithm; small 
values cause less prediction error. 

Stochastic Gradient Boosting Tree: A small change in 
the gradient boosting tree method can make it stochastic. For 
Stochastic GBT, a random subset of training dataset is used 
in each iteration rather than using the total training dataset. 
According to Friedman, the incorporation of randomness and 
the use of training data subset improve the performance of 
prediction as well as reduce computational complexity. 

Algorithm 3: Algorithm for implementation of Gradient 
Boosting Tree method (modified version 'TreeBoost') 

a) c„W = argmin2'' =l < ?'(.' / ,^)' 

s 

for a=l:a 

(i) Compute pseudo residuals of the dependent variable of 
the training dataset: 



y ia = 



dG(x,) 



for i=\,...,n 

(ii) Divide the training data space into [3 different 
regions R^a 7 ^2a"'^$a us ^ n 8 pseudo residuals. Least 
square splitting criterion is used to split the region. 
(Hi) Compute multiplier for each region b 

(£>£0,l,2...p) by solving the following optimization: 

it, = argmin^ ^ vO/.^v, (■*,■ ) + v) 
'i 

(iv) Update the model: 
P 

G m (x)=G m-l (x)+ ^ ^\a I{xeR ba )7 where I( ) is the indi " 



b=\ 



cator function. 

endfor 
b) Output F a (x). 



Integrated Analysis of Transcriptomic and Proteomic Data 



Current Genomics, 2013, Vol. 14, No. 2 101 



Validation process: The validation of the predicted miss- 
ing values is important for performance analysis. Garcia et 
al. used existing biological knowledge to validate their re- 
sults. The biological knowledge of Desulfovibrio vulgaris 
includes (i) functional categories of all genes (20 categories 
found from Comprehensive Microbial Resource [117]), (ii) 
sequence length, protein length, molecular weight, guanine- 
cytosine and triple codon counts of all gene, (Hi) a total of 
609 operons of Desulfovibrio vulgaris consisting of 2 to 13 
genes in each operon (iv) Regulons of Desulfovibrio vulgaris 
and (v) 92 metabolic pathways (KEGG pathways) for micro- 
bial genomes. 

The Coefficient of Variation (CV) for different operons 

and regulons are calculated using the predicted values of 

proteins by dividing the standard deviation (SD) by mean 

expression value of each operon/regulon/pathway group. Let 

an operon or a regulon or a pathway group has n genes in it; 

a random set consisting n genes was created and CV 

(CV , ) was calculated for that set of genes. This pro- 
random ° r 

cess was repeated for 1000 times for each oper- 
on/regulon/pathway and mean for CV , was calculated. 
° r J random 

The mean CV , was then compared against the CV of 
random r ° 

the original operon. The idea is that the variability of genes 
in operons or regulons will be less than the variation of ran- 
dom set of genes because the genes in an operon or regulon 
are supposed to be expressed together and are relationship in 
their expressions. If the CV of an operon is found to be less 

than the mean CV , , then it is concluded that, the pre- 
random' r 

dieted values of that operon is somehow close to accurate. 
The results found by Garcia et al. shows that a large portion 
of the operons/regulons/pathway groups indeed has less var- 
iability than the variability of randomly created groups of 
gene. Another measure used in this study for understanding 
the less dispersion of gene expressions in oper- 
ons/regulons/pathways is 'percentile score'. It's the percent- 
age of 1000 set of random genes for each oper- 
on/regulon/pathway group which have CV values 

(CV , ) less than the CV of original oper- 

random' b r 

on/regulon/pathway. Thus, the percentile score is expected to 
be less to prove that the variability in expression of genes in 
an operon/regulon/pathway is less dispersed than the random 
set of genes. 

Correlation between protein and mRNA expression was 
measured for each operon/regulon/pathway groups and also 
for all the genes. It was found that the correlation was 
stronger in most of the individual operons and pathway 
groups than the correlation for all genes. Small fraction of 
the regulon groups showed better correlation. 

The method applied for the validation process of this 
study clearly gives an idea about the overall prediction accu- 
racy but does not guarantee a good prediction. A poor pre- 
diction for all the genes in an operon might produce CV less 

than the mean CV , if the overall predictions for other 
random r 

operons are also similarly poor. A different option for valida- 
tion may be combination of this approach and incorporation 



of a testing dataset for cross-validation. The testing dataset 
will be a set of mRNA and protein datasets other than the 
training data whose actual values are also known. However, 
it will obviously reduce the size of training dataset. The pre- 
dicted values can be compared with the original values along 
with the method of validation using biological knowledge. 

5.6. Type 6 Example: Linear Regression Model to Inte- 
grate Transcriptomic and Proteomic Data 

In a study on Desulfovibrio vulgaris by Nie et al. [81], 
the effect of sequence features in different translational stag- 
es on the correlation of mRNA expression and protein abun- 
dance has been discussed. Multiple regression analysis that 
has been previously discussed in another paper by Nie et al. 
[118] was applied to predict the contribution of different 
sequence features on the correlation of mRNA and protein 
abundance. 

Sequence features in translational stages: A sequence 
feature 16 can be defined as an entity or data located in DNA 
or RNA sequences that are responsible for different biologi- 
cal phenomena. For example, Shine Dalnargo sequence is a 
sequence feature which is mainly a ribosomal binding site in 
mRNA and it helps the ribosome to start synthesis of protein. 
Other examples of sequence features can be start codon, stop 
codon, codon usage etc. In prokaryotes, translation can be 
divided into 3 stages: initiation of translation, elongation of 
translation and termination of translation. Lithwick et al. 
[64] demonstrates hierarchy of sequence features related to 
prokaryotic translation. Shine Dalgarno sequences, start co- 
don identity and start codon context are examples of initia- 
tion feature; codon usage and amino acid usage are examples 
of elongation feature; stop codon identity and stop codon 
context are examples of termination feature. 

Multiple Regression Analysis: A simple regression analy- 
sis can be expressed through the following equation: 

Protein .=A+B><mRNA . 
i i 

The target is to find A and B that relates the two varia- 
bles. Here, mRNA^ and Protein^ are logarithm of the mRNA 

and protein value of gene i respectively. Nie et al. [118] re- 

2 

ported that only 20-28% (Pearson correlation coefficient R ) 
of protein variability can be captured by simple regression 
analysis. This is because, protein abundance is not only re- 
lated to corresponding mRNA abundance but also depends 
on other different biological and chemical factors (termed as 
'covariate'). So multiple regression analysis is required 
which can be expressed as: 



Protein .=A+(mRNA .xfi)+ ^ (B .x Covariate . .) 

7=1 

where Protein, and mRNA. are the protein abundance data 

th 

and the mRNA expression level for the i gene respectively. 



' http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SDKDOCS/SEQFEAT.HTML 
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th th 
Covariate.j refers to the j covariate of the i gene. Bj rep- 

th 

resents the slope for the j covariate. Nie et al. [118] found 

2 

that 52-61% (Pearson correlation coefficient, R ) variability 
of protein can be captured by this multiple regression analy- 
sis. 

In this study of effect of sequence features, Nie et al. [81] 
used the sequence features in different translational stages as 
covariates and performed multiple regression analysis to 
locate the sequence features that has the highest effect on the 
mRNA-protein correlation. They have done multiple regres- 
sion analysis for each type of sequence feature (i.e. sequence 
features related to initiation stage, elongation stage and ter- 
mination stage) separately and also for a combination of se- 
quence features. The results showed that the sequence fea- 
tures are significantly responsible for the variation in 
mRNA-protein correlation. And also mRNA-protein correla- 
tion was affected the most by elongation stage features. The 
method of finding the effect of covariates in mRNA-protein 
correlation using one of the 3 datasets can be visualized by 
the flowchart shown in Fig. (4). 



Sequence features data 
(covariate) 




Termination 
covariates 




Analysis of effect of 
sequence features 
on correlation of 
mRNA and Protein 




Fig. (4). Finding effect of sequence features on mRNA-protein 
correlation by multiple regression analysis. 

Three different datasets containing transcriptomic and 
proteomic measurements were used in this sequence feature 
analysis of Desulfovibrio vulgaris. The three datasets were 
expression levels under three different growth conditions 
(lactate, formate and lactate-stationary). Partial correlation 

sequence features in the variability. The partial correlation 
coefficient can be interpreted as: 

2 2 
2 R 2 R l 

R = 7T 

P - -2 



l-R 



where R^ is the Pearson correlation coefficient using only 
2 

simple regression; Rr^ is the Pearson correlation coefficient 

when multiple regression model is used with sequence fea- 
tures included. Standard F-test [119] was used to examine 
the significance (P-value for the F-test) of each covariate. 

2 

The Rp for different sequence features varied; for 'SD 

sequence': 1 .9%— 3.8%, for sum of 'start codon': 
0.1%-0.7%, for 'start codon context': 0.3%-2.6%, for sum 
of 'codon usage': 5.3%— 15.7%, for sum of 'Amino acid us- 
age': 5.8%-11.9%, for sum of 'stop codon': 1.3%-2.3%, for 
sum of 'stop codon context': 3.7%-5.1%. 

2 

The sum of the individual R values for all the sequence 

2 

features are ranged in 21.8%-39.8% where the R for se- 
quence features together in a single multiple regression 
ranged in 15.2%-26.2%. 

So, the analysis proved that, among multiple sequence 
features, 'amino acid usage' and 'codon usage' are the top 
factors that affect the mRNA-protein correlation. The results 
were validated by conducting similar analysis where se- 
quence features were kept same for all the genes but protein 
values were randomly assigned to the mRNA values. The 
resulting P-value in this validation stage analysis was found 
to be less than the original P-values which indicates better 
statistical significance of the model found. 

Pointing out the factors affecting the mRNA-protein corre- 
lation was the major contribution of this study. These results 
can be utilized in creating robust model for mRNA and protein 
expression values. This is another proof of the fact that only 
mRNA expression does not necessarily have the power of 
predicting the protein expression. Complex biological factors 
such as sequence features related to translational stages should 
have a significant role in their prediction procedure. 

5.7. Type 7 Example: Correspondence Between Tran- 
scriptomic and Proteomic Expression Profiles Using 
Coupled Cluster Models 

The mRNA or protein expression of a random set of genes 
is likely to show multiple different levels of expression, but 
genes involved in similar functions or having similar effects on 
cellular regulation might show close expression levels. A mix- 
ture model [120] generally clusters such datasets into a prede- 
fined number of sub-sets in an unsupervised manner. For exam- 
ple, Gaussian mixture model is an unsupervised clustering algo- 
rithm where it is able to create soft boundaries among the clus- 
ters, i.e. points in the space can be present in any cluster defined 
by a given probability. This is primarily a mixture of a certain 
number of Gaussian distributions with unknown parameters 
where each Gaussian distribution fits its corresponding cluster. 
Estimation maximization (EM) algorithm [121] is used to find 
the parameters of the Gaussian distribution and the cluster prob- 
abilities. 

Simon Rogers et al. [82] proposed a coupled mixture 
model to investigate the correspondence between tran- 
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scriptomic and proteomic expressions. The dataset consisted 
of transcriptomic and proteomic profiles of 542 human genes 
from the Human Mammary Epithelial Cell line (HMEC). 
Measurements were taken between 0 and 24h after the cells 
were stimulated with Epidermal Growth Factor (EGF). There 
were a total of 6 transcriptomic (mRNA) measurements (1 
hr, 4 hr, 8 hr, 13 hr, 18 hr, 24 hr) and 7 proteomic (proteins) 
measurements (15', 1 hr, 4 hr, 8 hr, 13 hr, 18 hr, 24 hr). 

The mRNA and protein datasets were clustered individu- 
ally using Gaussian mixture model and the similarity be- 
tween the two sets of clusters were determined by standard 
Rand index [122]. The standard Rand index is ranged from 0 
to 1 where 1 denotes that the two cluster sets are exactly 
same. It was observed in the study that the two cluster sets 
showed very little similarity. The large dissimilarity suggest- 
ed that if the two datasets were clustered after concatenating 
them into a single dataset, the number of clusters could have 
been as large as 20*15=300 which was impractical as the 
total number of genes was 542. Also, by comparing gene 
ontology (GO) enrichment analysis, it was discovered that 
the individual clustering produced different biologically 
meaningful clusters which were lost when clusters were cre- 
ated after concatenation. The failure of individual and con- 
catenated clustering lead to the implementation of coupled 
mixture models described in this study. The ideas of cluster- 
ing individually and clustering after concatenation are illus- 
trated in Figs. ((5 and (6) respectively. 



mRNA 
data set 



Protein 
data set 



Clustering Clustering 



Clustered 
data 



v 



Clustered 
data 



Further Analysis 



o 




Find similarity 
measures between 
mRNA and protein 
clusters 



Fig. (5). Method for clustering individually. 




mRNA data set 
(T time points) 



Protein data set 
(T time points) 



Concatenation 



Concatenated data 
(2T time points) 



Clustering into 
Gaussian mixture 



concatenated data 



Further Analysis 





Compare properties 
of clusters with 
properties of 
individual clusters 



Fig. (6). Method for clustering after concatenation. 

In coupled mixture modeling, the mRNA dataset was 
clustered into 'IT different clusters with p(u)(uEl,2...U) de- 
noting the probability that mRNA expression of a gene be- 
th 

longs to the u cluster. Similarly protein dataset was clus- 
tered into 'V different clusters with p(v)(v€l,2...V) denoting 
the probability that protein expression of a gene belongs to 
th 

the v cluster. The joint probability can be described as 
p(u,v)=p(u)p(v\u) where p(y\u) is the parameter that provides 
the relationship between mRNA expression and protein lev- 
el. 

The EM algorithm was used to maximize a log- 
likelihood function (equation 1 in supplementary material of 
[82]) to infer the desired parameters. The number of clusters 
(U and V) in each dataset was derived to be f/=15 and V=20 
using Bayesian Information Criterion (BIC, proposed by 
Gideon E. Schwarz [123]). Fig. ((7) demonstrates the cou- 
pled mixture model. 

The values of p(v\u) can unravel important information 
about the complexity of the relationship between mRNA and 
protein expressions. For example, the protein cluster v=4 had 
a total of 19 proteins in it, 18 of those were ribosomal pro- 
teins. There were 7 mRNA clusters which had positive 
p(u\v=4) within the protein cluster v=4. The most connected 
mRNA cluster with this protein cluster was the cluster u=3 
because /?(m=3Iv=4)=0.3653 (which was the highest among 
all p(u\v=4)). If we look at the other protein clusters which 
are related to this mRNA cluster u=3, we'll see that, there 
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are 14 protein clusters which have positive p(v\u=3). This 
complex set of information suggests that indeed the relation- 
ship between mRNA and protein expression is a complex 
one; this decision could not be made with the results of indi- 
vidual clustering and concatenated clustering technique. The 
inference of complex relationships from those conditional 
probabilities remains open; may be use of more biological 
knowledge about the involvement of different sets of mRNA 
and proteins in different biological processes will reveal the 
relationship more clearly. Rogers et al. concentrated on three 
biological phenomena and others are still open problems. The 
three biological phenomena they dealt with are: (i) conserved 
behavior of ribosomes occurring at the protein level, (ii) dis- 
covering interesting set of genes involved in cell-adhesion and 
(iii) The role of TCP- 1 as a protein folding machine. 




Bayesian 
->( Information K- 
Criterion 



Number of mRNA clusters 
(K) and protein clusters (J) 



Initialize parameters of coupled 
mixture model including p(k), p(jlk) 



mRNA data set 
(Dimension T) 



Protein data set 
(Dimension T) 




Clustered 
mRNA dataset 



Conditional 
probabilities p(j Ik) 
and p(k) 



Clustered 
protein dataset 



Fig. (7). Coupled clustering method used by Rogers et al. [82]. 

The choice of the model in each component of the mix- 
ture is not limited to Gaussian. Other models can be used to 
construct mixture models such as ordinary differential equa- 
tion model used by Chudova et al. [124] and B-splines used 
by Luan and Li [125]. 

We should note that the mRNA and protein expression of 
542 genes used in Rogers et al. was a subset of the original 
dataset that had a lot of missing values for proteomic expres- 
sion. The genes that had both transcriptomic and proteomic 
values present were used in the study. An interesting idea 
may be to use missing value prediction method described in 
section 5.5 (method by Garcia et al. [77]) to complete the 
original dataset and use that in this study. Combining the 
approaches by Garcia et al. [77] and Rogers et al. [82] can 
possibly improve mRNA and protein correspondence when 
missing values possess a significant issue. 



5.8. Type 8 Example: Dynamic Models 

Nariai et al. [88] used cell cycle microarray data [126] of 
Saccharomyces cerevisiae and 9030 protein-protein interac- 
tion data derived from MIPS database [127] to construct a 
Bayesian network model. The authors proved that the use of 
p-p interaction data had refined the estimated gene network 
produced by using only microarray data. The algorithm used 
to construct the network can be simply illustrated by the 
flowchart showed in Fig. ((8). The algorithm is designated as 
the greedy hill-climbing algorithm. 



Calculate BNRC score after performing 
(i) add a parent (ii) remove a parent (iii) 
reverse parent child (iv) nothing 




Construct protein complex 
with gene i and the added gene 
using PCA. Is the protein 
complex works better as a 
parent of each child of gene i 
and the added gene than the 
gene i or the added gene, keep 
the complex in network 
otherwise ignore it 




Fig. (8). The greedy hill-climbing algorithm for finding and model- 
ing protein complexes and estimating a gene network. 

In the greedy hill-climbing algorithm, each network was 
evaluated by Bayesian network and Nonparametric Regres- 
sion Criterion (BNRC) score [88]. Parents of each gene 
(genes that regulate a gene are called parents of that gene) 
were determined using this algorithm; the parents can be 
protein complex or other genes. Principal component analy- 
sis was used to find the protein complexes that were in- 
volved in regulating certain genes. Three gene networks 
were estimated: (i) by using only microarray data (ii) by us- 
ing only p-p interaction data and (iii) by using the greedy 
hill-climbing algorithm. The three networks were then com- 
pared with the KEGG compiled network for evaluation. The 
network edges agreed with the KEGG pathway used for 
comparison. By using 350 chosen genes from the MIPS 
functional category mitotic cell cycle and cell cycle control, 
34 protein complexes were discovered in this study (22 of 
these 34 complexes are listed in MIPS complex catalog). 
Also, incorporating phase information of cell cycle (e.g. 
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Gl/S, S, M phase) revealed biologically important relation- 
ships of several genes that are not included in the KEGG 
pathway. 

It is well established that inferring GRNs from mRNA 
data alone has a huge computational complexity and often 
lack in accuracy. Thus, a number of studies [88-90] used 
prior biological knowledge to reduce the complexity and/or 
incorporated other type(s) of data to increase the accuracy of 
the inference process. But to our knowledge, no such study 
has been done that uses protein abundance data along with 
transcriptomic data to infer a GRN which can reveal a dy- 
namic relationship model between transcriptomic and prote- 
omic network. 

6. DISCUSSION 

Studies have shown that there exists a poor correlation 
between mRNA expression and protein abundance [1-5]. 
Some possible reasons based on protein half-lives, post tran- 
scription machinery etc. has been proposed. It is interesting 
to locate analogies between biological scenarios and other 
physical scenarios so that approaches used for the analysis of 
one can throw insights and be possibly used for the analysis 
of the other. For instance, Wang et al. [128] compared the 
gene-mRNA-protein structure with computer's internal 
structure and proposed a theory that can be used to verify the 
reason behind the poor correlations. 

On a similar note, we propose that a gene-transcriptome- 
proteome network has a number of similarities with an or- 
ganizational command structure. We next show the relations 
between the two using a military command system. 

The military headquarter can be assumed as the main 
data processor and command center during war time. Head- 
quarter send commands to the base camp that actually con- 
trols the on-field troops. The on-field troops directly take 
part in the operation. Command and direction of the head- 
quarter can be viewed as the gene sequence in DNA which 
encodes the proteins. The base camp command can be 
viewed as the transcriptome and the on-field troops can be 
viewed as the proteome. The factors that affect any on-field 
troop in the operation can be viewed as metabolites and other 
external conditions. A troop may not be always able to ex- 
actly perform as commanded and sometimes the base camp 
cannot pass on the exact command from the head quarter due 
to on-field scenario. Furthermore, the base camp may have a 
different strategy to implement the commands relayed from 
the headquarter resulting in a delay in the implementation of 
the headquarter command. Thus, reflection of any command 
may not be instantaneous. Similarly, transcriptomic exist- 
ence does not guarantee instantaneous or even delayed ver- 
sion of proteomic abundance. 

Furthermore, a command from the headquarters may not 
require completion depending on the on-field situation and 
instantaneous thinking. Thus, delay of the command propa- 
gation plays an important role. Similarly, the delay in creat- 
ing a protein from gene-mRNA-protein (central dogma) sys- 
tem may cause the desired protein 'not needed' at all. 

The above mentioned analogy can be supported by the 
fact that the command may be adaptively changed based on 
the needs of the troop or success i.e. there is a feedback from 
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the troop which may alter the war-plan. Similarly, feedback 
from outside factors and proteins can control the 'on' and 
'off mechanism of genes. Thus, to understand the biological 
mechanism and associated network, we need to have a de- 
tailed idea of how the proteins and mRNAs react to outside 
stimulations and how the commands from the genes are ne- 
glected resulting in a poor correlation between transcriptom- 
ic and proteomic data. In brief, we need to broaden our view 
just from transcriptomic abundance and proteomic abun- 
dance and consider an integrated transcriptomic -proteomic 
approach incorporating other factors such as external condi- 
tions, metabolites etc. The relation between transcriptomic 
and proteomic domain can be better understood if a time 
series of gene and protein expression for single cells are 
available for a good length of time with high sampling fre- 
quency. 

As mentioned earlier, similarities between different do- 
mains allow us to apply techniques developed for one do- 
main in another and also provide unique viewpoints for un- 
derstanding the system behavior. Since biological networks 
are extremely complex and large-scale, a natural question 
arises whether we can relate them to other complex networks 
such as social networks, communication networks, web 
graphs etc. [129-131]. The recent development of online 
social networks offers an analogy between the molecular 
biology networks and social networks. The social networks 
can be considered to have two primary domains: one of them 
consists of the network of physical relations as manifested 
through verbal communications in workplaces, schools, 
neighborhood etc. and the other is the network of online rela- 
tions through social media such as facebook, Linkedln, twit- 
ter, Wikipedia, play station networks etc. We can associate 
the transcriptomic domain as the physical relation network 
and the proteomic domain as the social media network. The 
portion of individuals participating in social media can be 
considered as the protein coding genes. The social media 
network and the physical social networks are both connected 
and can have very similar communities just like related set of 
mRNAs and proteins are involved in specific functions. The 
commands of protein generation through mRNAs are similar 
to individuals expressing their views on social media. There 
will be links in physical social networks and links in virtual 
social networks and multitude of cross-links between these 
two networks. A number of views of an individual may not 
be instantly reflected in the social media due to surrounding 
physical situations, delay in reaching the device to post the 
message or disturbances in the online network. Expressions 
of emotions in the physical world are quick and can change 
fast similar to mRNAs that are mostly transient. Due to the 
vast memory of online interactions, emotion expressed in the 
social media remains for longer time similar to the scenario 
of generated proteins being in the system for longer time. 
Assuming this scenario, we can ask questions such as does 
the problem of learning the actual emotions of individuals of 
the social network by asking online questions equivalent to 
understanding the biological regulatory mechanism by study- 
ing protein abundance alone? In a social network, an indi- 
vidual's bad disposition can affect the moods of closely 
linked people around him/her (such as members of his/her 
community in the social network) just like RNAs can affect 
other surrounding RNAs. However, the manifestation of an 
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individual's gloomy state of mind in the social media net- 
work may not have the same effect as observed in the physi- 
cal social network whereas some other expressions in the 
online world can influence more links in the online and 
physical networks as compared to the expression propagated 
through the physical social network. In a similar fashion, 
some proteins can have much more effect on surrounding 
proteins under specific conditions as compared to the effect 
of the corresponding mRNA on its surrounding mRNAs. As 
a single time snapshot is not sufficient to detect how infor- 
mation is spreading through a social network, single time 
snapshots of mRNAs and Proteins are not suitable for under- 
standing the biological machinery. Time series data of 
mRNAs and Proteins for individual cells are required to get 
better understanding of the interactions of the transcriptomic 
and proteomic domains. 

7. CONCLUSIONS 

As compared to existing reviews [8, 7, 6, 10] on joint 
transcriptomic and proteomic profiling, the current article 
focuses on uncovering the primary categories of approaches 
that have been proposed for fusion of transcriptomic and 
proteomic data. We have divided the existing methods into 
eight main categories and illustrated each by specific exam- 
ple of studies. For a researcher searching for ways to com- 
bine a set of transcriptomic and proteomic profiles, this re- 
view provides a concise overview of the existing analysis 
techniques categorized into eight types and the advantages 
and limitations of the various approaches. For further in- 
sights, we provide analogies of the transcriptomic and prote- 
omic expression scenario with cases in large scale organiza- 
tional and social networks. This can possibly allow design of 
methodologies for joint analysis of mRNA and Protein ex- 
pression data based on fusion techniques applied in other 
large scale network analysis. 
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